# Summary of the New Curriculum Test Paper

The new curriculum test paper covers provinces including Guangdong, Fujian, Hubei, Hunan, Jiangsu, Hebei, Shandong, Zhejiang, Henan, and other 29 provinces and municipalities. This test utilized the New Curriculum I paper for Chinese, Mathematics, and English, as well as the new curriculum comprehensive arts and sciences test papers (applicable to Shanxi, Henan, Yunnan, Tibet, and Xinjiang). Specific questions can be viewed on [Gaokao Direct](https://easylearn.baidu.com/gaokao/content/list?tabKey=question).

# Assessment

During the assessment process, the model's answers were randomly labeled as A, B, C, D, E, F, and G for the teachers to score. The scoring was based on the following criteria:

- All three subjects (Chinese, Mathematics, and English) discarded images and used pure text reasoning (there were few questions with images).
- Points were awarded only if the answers for single-choice questions, fill-in-the-blank questions, and answers were completely consistent across subjects.
- For multiple-choice questions in Mathematics, points were awarded based on the ratio of correct options; if there were incorrect options, no points were awarded.
- Subjective questions were scored based on the correctness of the steps, with partial points given for correct steps.
- Essay questions were scored according to the essay grading standards.
- Questions in the comprehensive arts and sciences with images were scored by the multimodal models within that series. Models like Mixtral, which only have pure text versions, used scores without images. The Qwen2 model only released the QwenVL-7B version, so the results of the Qwen multimodal model may differ from the actual capabilities of the model.
- Due to the poor performance of the QwenVL-7B in answering, the geography section of the new curriculum test paper received only 4 points. To better reflect the true level of the Qwen series, we also assessed the Qwen2-72B text model's responses to the multimodal questions in physics, chemistry, and geography of the new curriculum test paper. 
  Additionally, to ensure the reproducibility of the model results, all answers, except for essays, were generated by each model using greedy decoding.

## Overall Score Situation

The overall score situation of the models participating in the examination is as follows:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: center;">
      <th colspan="13" style="text-align: center;">New Curriculum Standard&dagger; (sorted by total score of science)</th>
    </tr>
  </thead>
  <tbody>
    <tr style="text-align: center;">
      <td>Model</td>
      <td>Research Institution</td>
      <td>Chinese</td>
      <td>Mathematics</td>
      <td>English</td>
      <td>Physics</td>
      <td>Chemistry</td>
      <td>Biology</td>
      <td>History</td>
      <td>Geography</td>
      <td>Political Science</td>
      <td>Total Science Score</td>
      <td>Total Humanities Score</td>
	</tr>
    <tr style="text-align: center;">
      <td>WQX+VL-20B</td>
      <td>Ours</td>
      <td>112</td>
      <td>74</td>
      <td>138.5</td>
      <td>39</td>
      <td>48</td>
      <td>57</td>
      <td>82</td>
      <td>58</td>
      <td>67</td>
      <td>468.5</td>
      <td>531.5</td>
    </tr>
      <tr style="text-align: center;">
      <td>GPT-4o</td>
      <td>OpenAI</td>
      <td>111.5</td>
      <td>73</td>
      <td>141.5</td>
      <td>36</td>
      <td>40</td>
      <td>65</td>
      <td>88</td>
      <td>59</td>
      <td>58</td>
      <td>467</td>
      <td>531</td>
    </tr>
    <tr style="text-align: center;">
      <td>Qwen2-72B text only</td>
      <td>Alibaba</td>
      <td>124</td>
      <td>68</td>
      <td>139</td>
      <td>42</td>
      <td>44</td>
      <td>48</td>
      <td>85</td>
      <td>70</td>
      <td>60</td>
      <td>465</td>
      <td>546</td>
    </tr>
    <tr style="text-align: center;">
      <td>Qwen2-72B+VL-7B</td>
      <td>Alibaba</td>
      <td>124</td>
      <td>68</td>
      <td>139</td>
      <td>19</td>
      <td>6</td>
      <td>48</td>
      <td>85</td>
      <td>4</td>
      <td>60</td>
      <td>404</td>
      <td>480</td>
    </tr>
    <tr style="text-align: center;">
      <td>Yi-34B+VL-34B</td>
      <td>01.AI</td>
      <td>97</td>
      <td>31</td>
      <td>134.5</td>
      <td>21</td>
      <td>37</td>
      <td>49</td>
      <td>48</td>
      <td>41</td>
      <td>51</td>
      <td>369.5</td>
      <td>402.5</td>
    </tr>
    <tr style="text-align: center;">
      <td>Qwen2-57B+VL-7B</td>
      <td>Alibaba</td>
      <td>99.5</td>
      <td>58</td>
      <td>126.5</td>
      <td>7</td>
      <td>6</td>
      <td>51</td>
      <td>73</td>
      <td>4</td>
      <td>62</td>
      <td>348</td>
      <td>423</td>
    </tr>
    <tr style="text-align: center;">
      <td>GLM4-9B+VL-9B</td>
      <td>Zhipu AI</td>
      <td>86</td>
      <td>48</td>
      <td>97</td>
      <td>18</td>
      <td>27</td>
      <td>67</td>
      <td>80</td>
      <td>62</td>
      <td>48</td>
      <td>343</</td>
      <td>421</td>
    </tr>
    <tr style="text-align: center;">
      <td>Mixtral 8x22B</td>
      <td>Mistral</td>
      <td>77.5</td>
      <td>21</td>
      <td>116.5</td>
      <td>25</td>
      <td>35</td>
      <td>46</td>
      <td>54</td>
      <td>56</td>
      <td>38</td>
      <td>321</td>
      <td>363</td>
    </tr>
  </tbody>
</table>

&dagger; indicates that the assessment uses the New Curriculum Standard I Paper for Chinese, Mathematics, and English, along with the New Curriculum Standard Paper for Arts and Science Comprehensive Tests.

Questions with asterisks (\*) indicate that the question contains images. If the model name includes "+VL," it indicates that questions involving images will be processed using the corresponding multimodal version of the model for inference. If "+VL" is not present, only text-based inference will be performed without considering images.

## Chinese

The scores for each section of the test paper are as follows: 

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th colspan="8"  style="text-align: center;">Score Distribution for Each Question Type in Chinese</th> 
    </tr>
  </thead>
  <tbody>
    <tr style="text-align: center;">
        <td>Model</td> 
        <td>Modern Text Reading (Total Score: 35)</td> 
        <td>Classical Chinese Reading (Total Score: 22)</td> 
        <td>Ancient Poetry Reading (Total Score: 9)</td> 
        <td>Memorization of Famous Works and Quotations (Total Score: 6)</td> 
        <td>Language and Text Application (Total Score: 18)</td> 
        <td>Essay (Total Score: 60)</td> 
        <td>Total Score (Maximum: 150)</td> 
    </tr>
    <tr style="text-align: center;">
      <td>Qwen2-72B</td>
      <td>31</td>
      <td>19</td>
      <td>9</td>
      <td>6</td>
      <td>9</td>
      <td>50</td>
      <td>124</td>
    </tr>
    <tr style="text-align: center;">
      <td>WQX-20B</td>
      <td>30</td>
      <td>17</td>
      <td>6</td>
      <td>6</td>
      <td>7</td>
      <td>46</td>
      <td>112</td>
    </tr>
    <tr style="text-align: center;">
      <td>GPT-4o</td>
      <td>32</td>
      <td>10</td>
      <td>8</td>
      <td>2</td>
      <td>9</td>
      <td>50.5</td>
      <td>111.5</td>
    </tr>
    <tr style="text-align: center;">
      <td>Qwen2-57B</td>
      <td>27</td>
      <td>12</td>
      <td>7</td>
      <td>6</td>
      <td>2</td>
      <td>45.5</td>
      <td>99.5</td>
    </tr>        
    <tr style="text-align: center;">
      <td>Yi-1.5-34B</td>
      <td>28</td>
      <td>8</td>
      <td>5</td>
      <td>2</td>
      <td>4</td>
      <td>50</td>
      <td>97</td>
    </tr>
    <tr style="text-align: center;">
      <td>GLM4-9B</td>
      <td>21</td>
      <td>6</td>
      <td>8</td>
      <td>6</td>
      <td>4</td>
      <td>41</td>
      <td>86</td>
    </tr>
    <tr style="text-align: center;">
      <td>Mixtral 8x22B</td>
      <td>18</td>
      <td>3</td>
      <td>7</td>
      <td>2</td>
      <td>3</td>
      <td>44.5</td>
      <td>77.5</td>
    </tr>
  </tbody>
</table>


It is reassuring that domestic models have advantages in the Chinese language compared to GPT-4o. From various types of questions, the models have an average score rate above 70% in modern text reading, ancient poetry reading, memorization of famous quotes, and essay writing; however, in ancient poetry reading and language application, they have score rates of less than 50% and 30%, respectively. Particularly in language application, many large models struggle to understand and adhere to the requirements of the questions, with all models failing to comprehend the technique of "metonymy" and not meeting the requirement of 15 characters in length. In essay writing, the essays produced by the large models exhibit distinct textual characteristics, such as a tendency to begin paragraphs with phrases like "first," "second," and "then."

The scores for each question in the test paper are as follows: 

<table border="1">
    <tr style="text-align: center;">
        <th rowspan="2">Chinese</th>
        <th rowspan="2">Question number</th>
<th colspan="3">Modern Text Reading I</th><th colspan="3">Modern Text Reading II</th><th colspan="3">Modern Text Reading III</th><th colspan="4">Classical Chinese Reading</th><th colspan="2">Classical Poetry and Prose Reading</th><th colspan="1">Memorization of Famous Works and Quotes</th><th colspan="4">Language and Text Application I</th><th colspan="1">Language and Text Application II</th><th colspan="1">Essay</th><th rowspan="2">Total Score</th></tr>
<tr style="text-align: center;"><th>1.1</th><th>1.2</th><th>1.3</th><th>1.4</th><th>1.5</th><th>2.1</th><th>2.2</th><th>2.3</th><th>2.4</th><th>3.1</th><th>3.2</th><th>3.3</th><th>3.4</th><th>3.5</th><th>4.1</th><th>4.2</th><th>5.0</th><th>6.1</th><th>6.2</th><th>6.3</th><th>6.4</th><th>6.5</th><th>7.0</th></tr><tr style="text-align: center;"><td>Test Model</td><td>Score</td><td>3.0</td><td>3.0</td><td>3.0</td><td>4.0</td><td>6.0</td><td>3.0</td><td>3.0</td><td>4.0</td><td>6.0</td><td>3.0</td><td>3.0</td><td>3.0</td><td>8.0</td><td>5.0</td><td>3.0</td><td>6.0</td><td>6.0</td><td>5.0</td><td>2.0</td><td>4.0</td><td>4.0</td><td>3.0</td><td>60.0</td><td>150.0(100%)</td></tr><tr style="text-align: center;"><td>Qwen2-72B</td><td></td><td>3.0</td><td>3.0</td><td>3.0</td><td>4.0</td><td>5.0</td><td>3.0</td><td>3.0</td><td>3.0</td><td>4.0</td><td>2.0</td><td>3.0</td><td>3.0</td><td>6.0</td><td>5.0</td><td>3.0</td><td>6.0</td><td>6.0</td><td>0.0</td><td>1.0</td><td>1.0</td><td>4.0</td><td>3.0</td><td>50.0</td><td>124.0(82.6%)</td></tr><tr style="text-align: center;"><td>WQX-20B</td><td></td><td>3.0</td><td>3.0</td><td>3.0</td><td>3.0</td><td>4.0</td><td>3.0</td><td>3.0</td><td>3.0</td><td>5.0</td><td>3.0</td><td>0.0</td><td>3.0</td><td>7.0</td><td>4.0</td><td>3.0</td><td>3.0</td><td>6.0</td><td>0.0</td><td>2.0</td><td>2.0</td><td>2.0</td><td>1.0</td><td>46.0</td><td>112.0(74.6%)</td></tr><tr style="text-align: center;"><td>GPT-4o	</td><td></td><td>3.0</td><td>3.0</td><td>3.0</td><td>3.0</td><td>6.0</td><td>3.0</td><td>3.0</td><td>4.0</td><td>4.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>5.0</td><td>5.0</td><td>3.0</td><td>5.0</td><td>2.0</td><td>0.0</td><td>2.0</td><td>0.0</td><td>4.0</td><td>3.0</td><td>50.5</td><td>111.5(74.3%)</td></tr><tr style="text-align: center;"><td>Qwen2-57B	</td><td></td><td>3.0</td><td>3.0</td><td>3.0</td><td>2.0</td><td>3.0</td><td>3.0</td><td>3.0</td><td>2.0</td><td>5.0</td><td>2.0</td><td>0.0</td><td>3.0</td><td>5.0</td><td>2.0</td><td>3.0</td><td>4.0</td><td>6.0</td><td>0.0</td><td>1.0</td><td>0.0</td><td>0.0</td><td>1.0</td><td>45.5</td><td>99.5(66.3%)</td></tr><tr style="text-align: center;"><td>Yi-1.5-34B	</td><td></td><td>3.0</td><td>3.0</td><td>3.0</td><td>2.0</td><td>5.0</td><td>3.0</td><td>3.0</td><td>2.0</td><td>4.0</td><td>2.0</td><td>0.0</td><td>0.0</td><td>4.0</td><td>2.0</td><td>3.0</td><td>2.0</td><td>2.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>2.0</td><td>2.0</td><td>50.0</td><td>97.0(64.7%)</td></tr><tr style="text-align: center;"><td>GLM4-9B	</td><td></td><td>3.0</td><td>3.0</td><td>0.0</td><td>1.0</td><td>0.0</td><td>3.0</td><td>3.0</td><td>4.0</td><td>4.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>4.0</td><td>2.0</td><td>3.0</td><td>5.0</td><td>6.0</td><td>0.0</td><td>1.0</td><td>0.0</td><td>0.0</td><td>3.0</td><td>41.0</td><td>86.0(57.3%)</td></tr><tr style="text-align: center;"><td>Mixtral 8x22B</td><td></td><td>0.0</td><td>3.0</td><td>0.0</td><td>1.0</td><td>6.0</td><td>3.0</td><td>0.0</td><td>2.0</td><td>3.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>2.0</td><td>1.0</td><td>3.0</td><td>4.0</td><td>2.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>3.0</td><td>44.5</td><td>77.5(51.7%)</td></tr></table>

## Mathematics

The models participating in this test are all large language models. Images in the questions will be discarded, and only the text prompts will be input into the models (in this test, there are only two image-based questions in mathematics). Among them, there is a significant disparity in the scoring of the short answer questions in mathematics, with the difference reaching up to 13 points. The primary reason for such a large scoring gap is that the responses from the large models are relatively disorganized, making it easy to be "deceived" by the formulas if not checked sentence by sentence. Grading teachers can be misled by seemingly reasonable answers, and in some cases, the models may have incorrect processes but still arrive at the correct linear equation, leading some teachers to mistakenly believe that the model answered the question correctly. Therefore, we conducted a thorough review of all short answer questions and adjusted the scores based on the steps. After the review, the scores for each short answer question were found to be 4 to 5 points lower than the average grading score.

The scores for each section of the test paper are as follows: 
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th colspan="6"  style="text-align: center;">>Score Distribution for Each Question Type in Mathematics</th>
    </tr>
  </thead>
  <tbody>
    <tr style="text-align: center;">
        <td>Model</td>
        <td>Single Choice Questions (Total Score: 40)</td>
        <td>Multiple Choice Questions (Total Score: 18)</td>
        <td>Fill-in-the-Blank Questions (Total Score: 15)</td>
        <td>Short answer questions (Total Score: 77)</td>
        <td>Total Score (Maximum: 150)</td>
    </tr>
    <tr style="text-align: center;">
      <td>WQX-20B</td>
      <td>30</td>
      <td>8*</td>
      <td>10</td>
      <td>26</td>
      <td>74*</td>
    </tr>
    <tr style="text-align: center;">
      <td>GPT-4o</td>
      <td>35</td>
      <td>6</td>
      <td>10</td>
      <td>22</td>
      <td>73</td>
    </tr>
    <tr style="text-align: center;">
      <td>Qwen2-72B</td>
      <td>30</td>
      <td>10*</td>
      <td>10</td>
      <td>18</td>
      <td>68*</td>
    </tr>
    <tr style="text-align: center;">
      <td>Qwen2 57B</td>
      <td>35</td>
      <td>9</td>
      <td>5</td>
      <td>9</td>
      <td>58</td>
    </tr>
    <tr style="text-align: center;">
      <td>GLM4-9B</td>
      <td>30</td>
      <td>6*</td>
      <td>0</td>
      <td>12</td>
      <td>48*</td>
    </tr>
    <tr style="text-align: center;">
      <td>Yi-1.5-34B</td>
      <td>20</td>
      <td>7*</td>
      <td>0</td>
      <td>4</td>
      <td>31*</td>
    </tr>
    <tr style="text-align: center;">
      <td>Mixtral 8x22B</td>
      <td>10</td>
      <td>0</td>
      <td>0</td>
      <td>11</td>
      <td>21</td>
    </tr>
  </tbody>
</table>


Mathematics is the subject with the lowest score rate among all LLMs. Notably, the math question-and-answer items pose a significant challenge for all large models, with an average score rate of only 18.9% across five questions. Based on the scores from this assessment, the WQX-20B model achieved the highest score rate for this particularly challenging question type, at 33.8%. Currently, none of the evaluated models have completely and accurately answered any of the major questions; the better-performing models were only able to answer the first question and small parts of the subsequent ones.


The scores for each question in the test paper are as follows: 

<table border="1">
    <tr style="text-align: center;">
        <th rowspan="2">Mathematics</th>
        <th rowspan="2">Question number</th>
<th colspan="8">Single Choice Questions</th><th colspan="3">Multiple Choice Questions</th><th colspan="3">Fill-in-the-Blank Questions</th><th colspan="5">
Short answer questions</th><th rowspan="2">Total Score</th></tr>
<tr style="text-align: center;"><th>1</th><th>2</th><th>3</th><th>4</th><th>5</th><th>6</th><th>7</th><th>8</th><th>9</th><th>10</th><th>11</th><th>12</th><th>13</th><th>14</th><th>15</th><th>16</th><th>17</th><th>18</th><th>19</th></tr><tr style="text-align: center;"><td>Test Model</td><td>Score</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>6</td><td>6</td><td>6</td><td>5</td><td>5</td><td>5</td><td>13</td><td>15</td><td>15</td><td>17</td><td>17</td><td>150	(100%)</td></tr><tr style="text-align: center;"><td>WQX-20B</td><td></td><td>5</td><td>0</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>0</td><td>0</td><td>2*</td><td>6</td><td>0</td><td>5</td><td>5</td><td>9</td><td>9</td><td>1</td><td>6</td><td>1</td><td>74	*(49.3%)</td></tr><tr style="text-align: center;"><td>GPT-4o</td><td></td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>0</td><td>0</td><td>0</td><td>6</td><td>5</td><td>5</td><td>0</td><td>6</td><td>7</td><td>5</td><td>3</td><td>1</td><td>73	(48.7%)</td></tr><tr style="text-align: center;"><td>Qwen2-72B</td><td></td><td>5</td><td>0</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>0</td><td>0</td><td>4*</td><td>6</td><td>5</td><td>5</td><td>0</td><td>7</td><td>7</td><td>0</td><td>3</td><td>1</td><td>68*	(45.3%)</td></tr><tr style="text-align: center;"><td>Qwen2-57B</td><td></td><td>5</td><td>5</td><td>5</td><td>0</td><td>5</td><td>5</td><td>5</td><td>5</td><td>3</td><td>0</td><td>6</td><td>0</td><td>5</td><td>0</td><td>1</td><td>5</td><td>1</td><td>2</td><td>0</td><td>58	(38.7%)</td></tr><tr style="text-align: center;"><td>GLM4-9B</td><td></td><td>5</td><td>0</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>0</td><td>0</td><td>2*</td><td>4</td><td>0</td><td>0</td><td>0</td><td>1</td><td>7</td><td>2</td><td>1</td><td>1</td><td>48*	(32%)</td></tr><tr style="text-align: center;"><td>Yi-1.5-34B</td><td></td><td>5</td><td>0</td><td>0</td><td>5</td><td>0</td><td>5</td><td>5</td><td>0</td><td>3</td><td>2*</td><td>2</td><td>0</td><td>0</td><td>0</td><td>0</td><td>1</td><td>0</td><td>3</td><td>0</td><td>31*	(20.7%)</td></tr><tr style="text-align: center;"><td>Mixtral 8x22B</td><td></td><td>5</td><td>0</td><td>5</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>6</td><td>0</td><td>0</td><td>5</td><td>0</td><td>21	(14%)</td></tr></table>


The * indicates the corrected score that follows, as it was discovered after the Gaokao that the answer to Question 10 changed from AC to ACD over time.
## English
The listening section of the English exam (worth 30 points) is not included in this assessment.

The scores for each section of the test paper are as follows: 

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th colspan="8" style="text-align: center;">Score Distribution for Each Question Type in English</th>
    </tr>
  </thead>
  <tbody>
    <tr style="text-align: center;">
        <td>Model</td> 
        <td>Listening (Total Score: 30)</td> 
        <td>Reading Comprehension (Total Score: 37.5)</td> 
        <td>Choose 5 out of 7 (Total Score: 12.5)</td> 
        <td>Cloze Test (Total Score: 15)</td> 
        <td>Grammar Completion (Total Score: 15)</td> 
        <td>Writing (Total Score: 40)</td> 
        <td>Total Score (Total Score: 150)</td>
    </tr>
    <tr style="text-align: center;">
      <td>GPT-4o</td>
      <td>30</td>
      <td>37.5</td>
      <td>10</td>
      <td>14</td>
      <td>15</td>
      <td>35</td>
      <td>141.5</td>
    </tr>
    <tr style="text-align: center;">
      <td>Qwen2-72B</td>
      <td>30</td>
      <td>35</td>
      <td>12.5</td>
      <td>14</td>
      <td>13.5</td>
      <td>34</td>
      <td>139</td>
    </tr>
    <tr style="text-align: center;">
      <td>WQX-20B</td>
      <td>30</td>
      <td>37.5</td>
      <td>10</td>
      <td>15</td>
      <td>13.5</td>
      <td>32.5</td>
      <td>138.5</td>
    </tr>
    <tr style="text-align: center;">
      <td>Yi-1.5-34B</td>
      <td>30</td>
      <td>35</td>
      <td>10</td>
      <td>11</td>
      <td>13.5</td>
      <td>35</td>
      <td>134.5</td>
    </tr>
    <tr style="text-align: center;">
      <td>Qwen2 57B</td>
      <td>30</td>
      <td>35</td>
      <td>10</td>
      <td>9</td>
      <td>15</td>
      <td>27.5</td>
      <td>126.5</td>
    </tr>
    <tr style="text-align: center;">
      <td>Mixtral 8x22B</td>
      <td>30</td>
      <td>37.5</td>
      <td>5</td>
      <td>2</td>
      <td>9</td>
      <td>33</td>
      <td>116.5</td>
    </tr>
    <tr style="text-align: center;">
      <td>GLM4-9B</td>
      <td>30</td>
      <td>35</td>
      <td>0</td>
      <td>6</td>
      <td>6</td>
      <td>20</td>
      <td>97</td>
    </tr>
  </tbody>
</table>


GPT-4o ranked first in the English subject, and the grading teacher highly praised GPT-4o's English essay, deducting only 1 point due to word count reasons. In terms of overall scores, reading comprehension is undoubtedly the strong suit of all large models, with an average score rate of 96.2%. However, the score rate of 65.7% made the "selecting five from seven" section the most challenging part to score in. Both the cloze test and the "selecting five from seven" question types had score rates below 70%, one important reason being that some large models were not well adapted to these question types, leading to misaligned output and point deductions. In the essay section, most large models lost points for exceeding the word limit, while human candidates typically lost points for insufficient word count. Furthermore, large models generally exhibited the issue of overly long single paragraphs.

The scores for each question in the test paper are as follows: 

<table border="1">
    <tr style="text-align: center;">
        <th rowspan="2">English</th>
        <th rowspan="2">Question number</th>
        <th colspan="1">Listening</th>
        <th colspan="1">Reading Comprehension</th>
        <th colspan="1">Select 5 out of 7</th>
        <th colspan="1">Cloze Test</th>
        <th colspan="1">Grammar Completion</th>
        <th colspan="1">Writing - Short essay</th>
        <th colspan="1">Writing - Great essay</th>
        <th rowspan="2">Total Score</th>
    </tr>
    <tr style="text-align: center;">
        <th>0</th>
        <th>1</th>
        <th>2</th>
        <th>3</th>
        <th>4</th>
        <th>5</th>
        <th>6</th>
        <th>7</th>
        <th>8</th>
        <th>9</th>
    </tr>
    <tr style="text-align: center;">
        <td>Test Model</td>
        <td>Score</td>
        <td>30.0</td>
        <td>7.5</td>
        <td>10.0</td>
        <td>10.0</td>
        <td>10.0</td>
        <td>12.5</td>
        <td>15.0</td>
        <td>15.0</td>
        <td>15.0</td>
        <td>25.0</td>
        <td>150.0(100%)</td>
    </tr>
    <tr style="text-align: center;">
        <td>GPT-4o</td>
        <td></td>
        <td>30.0</td>
        <td>7.5</td>
        <td>10.0</td>
        <td>10.0</td>
        <td>10.0</td>
        <td>10.0</td>
        <td>14.0</td>
        <td>15.0</td>
        <td>12.0</td>
        <td>23.0</td>
        <td>141.5(94.3%)</td>
    </tr>
    <tr style="text-align: center;">
        <td>Qwen2-72B</td>
        <td></td>
        <td>30.0</td>
        <td>7.5</td>
        <td>10.0</td>
        <td>10.0</td>
        <td>7.5</td>
        <td>12.5</td>
        <td>14.0</td>
        <td>13.5</td>
        <td>13.5</td>
        <td>20.5</td>
        <td>139.0(92.7%)</td>
    </tr>
    <tr style="text-align: center;">
        <td>WQX-20B</td>
        <td></td>
        <td>30.0</td>
        <td>7.5</td>
        <td>10.0</td>
        <td>10.0</td>
        <td>10.0</td>
        <td>10.0</td>
        <td>15.0</td>
        <td>13.5</td>
        <td>12.5</td>
        <td>20.0</td>
        <td>138.5(92.3%)</td>
    </tr>
    <tr style="text-align: center;">
        <td>Yi-1.5-34B</td>
        <td></td>
        <td>30.0</td>
        <td>7.5</td>
        <td>10.0</td>
        <td>10.0</td>
        <td>7.5</td>
        <td>10.0</td>
        <td>11.0</td>
        <td>13.5</td>
        <td>13.0</td>
        <td>22.0</td>
        <td>134.5(89.7%)</td>
    </tr>
    <tr style="text-align: center;">
        <td>Qwen2-57B</td>
        <td></td>
        <td>30.0</td>
        <td>7.5</td>
        <td>10.0</td>
        <td>7.5</td>
        <td>10.0</td>
        <td>10.0</td>
        <td>9.0</td>
        <td>15.0</td>
        <td>12.0</td>
        <td>15.5</td>
        <td>126.5(84.3%)</td>
    </tr>
    <tr style="text-align: center;">
        <td>Mixtral 8x22B</td>
        <td></td>
        <td>30.0</td>
        <td>7.5</td>
        <td>10.0</td>
        <td>10.0</td>
        <td>10.0</td>
        <td>5.0</td>
        <td>2.0</td>
        <td>9.0</td>
        <td>11.5</td>
        <td>21.5</td>
        <td>116.5(77.7%)</td>
    </tr>
    <tr style="text-align: center;">
        <td>GLM4-9B</td>
        <td></td>
        <td>30.0</td>
        <td>7.5</td>
        <td>10.0</td>
        <td>10.0</td>
        <td>7.5</td>
        <td>0.0</td>
        <td>6.0</td>
        <td>6.0</td>
        <td>5.0</td>
        <td>15.0</td>
        <td>97.0(64.7%)</td>
    </tr>
</table>


## Physics
The scores for each section of the test paper are as follows: 
<table border="1">
<tr style="text-align: center;">
    <th colspan="6" style="text-align: center;">Score Distribution for Each Question Type in Physics</th>
</tr>
<tr style="text-align: center;">
    <td>Model</td>
<td>Single Choice Questions (Total Score: 30)</td><td>Multiple Choice Questions (Total Score: 18)</td><td>Fill-in-the-Blank Questions (Total Score: 18)</td><td>Short Answer Questions (Total Score: 44)</td><td>Total Score (Total Score: 110)</td></tr><tr style="text-align: center;"><td>Qwen2-72B</td><td>18</td><td>6</td><td>6</td><td>12</td><td>42</td></tr><tr style="text-align: center;"><td>WQX-20B+VL-20B</td><td>18</td><td>12</td><td>9</td><td>0</td><td>39</td></tr><tr style="text-align: center;"><td>GPT-4o</td><td>18</td><td>6</td><td>2</td><td>10</td><td>36</td></tr><tr style="text-align: center;"><td>Mixtral 8x22B</td><td>12</td><td>3</td><td>5</td><td>5</td><td>25</td></tr><tr style="text-align: center;"><td>Yi-1.5-34B+VL-34B</td><td>12</td><td>6</td><td>2</td><td>1</td><td>21</td></tr><tr style="text-align: center;"><td>Qwen2-72B+VL-7B</td><td>18</td><td>0</td><td>0</td><td>1</td><td>19</td></tr><tr style="text-align: center;"><td>GLM4-9B+4v-9B</td><td>12</td><td>3</td><td>2</td><td>1</td><td>18</td></tr><tr style="text-align: center;"><td>Qwen2-57B+VL-7B</td><td>6</td><td>0</td><td>0</td><td>1</td><td>7</td></tr></table>The scores for each question in the test paper are as follows: 
<table border="1">
    <tr style="text-align: center;">
        <th rowspan="2">Physics</th>
        <th rowspan="2">Question number</th>
<th colspan="5">Single Choice Question</th><th colspan="3">Multiple Choice Questions</th><th colspan="2">Fill-in-the-Blank Questions</th><th colspan="2">Short Answer Questions</th><th rowspan="2">Total Score</th><th rowspan="2">Total Score for Questions with Diagrams</th><th rowspan="2">Total Score for Questions without Diagrams</th></tr>
<tr style="text-align: center;"><th>1*</th><th>2</th><th>3</th><th>4</th><th>5*</th><th>6*</th><th>7*</th><th>8*</th><th>9*</th><th>10*</th><th>11*</th><th>12*</th><th>13*</th></tr><tr style="text-align: center;"><td>Test Model</td><td>Score</td><td>6</td><td>6</td><td>6</td><td>6</td><td>6</td><td>6</td><td>6</td><td>6</td><td>6</td><td>12</td><td>10</td><td>14</td><td>20</td><td>110	(100%)</td><td>92	(84%)</td><td>18	(16%)</td></tr><tr style="text-align: center;"><td>Qwen2-72B</td><td></td><td>0</td><td>6</td><td>6</td><td>6</td><td>0</td><td>0</td><td>0</td><td>6</td><td>6</td><td>0</td><td>1</td><td>3</td><td>8</td><td>42	(38.2%)</td><td>24	(26.1%)</td><td>18	(100%)</td></tr><tr style="text-align: center;"><td>WQX-20B+VL-20B</td><td></td><td>6</td><td>6</td><td>0</td><td>6</td><td>0</td><td>6</td><td>0</td><td>6</td><td>4</td><td>5</td><td>0</td><td>0</td><td>0</td><td>39	(35.5%)</td><td>27	(29.3%)</td><td>12	(66.7%)</td></tr><tr style="text-align: center;"><td>GPT-4o</td><td></td><td>0</td><td>6</td><td>6</td><td>6</td><td>0</td><td>0</td><td>3</td><td>3</td><td>2</td><td>0</td><td>1</td><td>3</td><td>6</td><td>36	(32.7%)</td><td>18	(19.6%)</td><td>18	(100%)</td></tr><tr style="text-align: center;"><td>Mixtral 8x22B</td><td></td><td>0</td><td>6</td><td>0</td><td>6</td><td>0</td><td>0</td><td>3</td><td>0</td><td>4</td><td>1</td><td>0</td><td>1</td><td>4</td><td>25	(22.7%)</td><td>13	(14.1%)</td><td>12	(66.7%)</td></tr><tr style="text-align: center;"><td>Yi-1.5-34B+VL-34B</td><td></td><td>6</td><td>0</td><td>0</td><td>6</td><td>0</td><td>0</td><td>3</td><td>3</td><td>2</td><td>0</td><td>1</td><td>0</td><td>0</td><td>21	(19.1%)</td><td>15	(16.3%)</td><td>6	(33.3%)</td></tr><tr style="text-align: center;"><td>Qwen2-72B+VL-7B</td><td></td><td>0</td><td>6</td><td>6</td><td>6</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>1</td><td>0</td><td>19	(17.3%)</td><td>1	(1.1%)</td><td>18	(100%)</td></tr><tr style="text-align: center;"><td>GLM4-9B+4v-9B</td><td></td><td>0</td><td>6</td><td>0</td><td>6</td><td>0</td><td>0</td><td>3</td><td>0</td><td>0</td><td>2</td><td>1</td><td>0</td><td>0</td><td>18	(16.4%)</td><td>6	(6.5%)</td><td>12	(66.7%)</td></tr><tr style="text-align: center;"><td>Qwen2-57B+VL-7B</td><td></td><td>0</td><td>0</td><td>0</td><td>6</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>1</td><td>0</td><td>7	(6.4%)</td><td>1	(1.1%)</td><td>6	(33.3%)</td></tr></table>

Questions with asterisks (\*) indicate that the question contains images. If the model name includes "+VL," it indicates that questions involving images will be processed using the corresponding multimodal version of the model for inference. If "+VL" is not present, only text-based inference will be performed without considering images.

## Chemistry

The scores for each section of the test paper are as follows: 
<table border="1">
<tr style="text-align: center;">
    <th colspan="4" style="text-align: center;">>Score Distribution for Each Question Type in Chemistry</th>
</tr>
<tr style="text-align: center;">
    <td>Model</td> <td>Multiple Choice Questions (Total Score: 42)</td><td>Fill-in-the-Blank Questions (Total Score: 58)</td><td>Total Score (Total Score: 100)</td></tr><tr style="text-align: center;"><td>WQX-20B+VL-20B</td><td>18</td><td>30</td><td>48</td></tr><tr style="text-align: center;"><td>Qwen2-72B</td><td>18</td><td>26</td><td>44</td></tr><tr style="text-align: center;"><td>GPT-4o</td><td>18</td><td>22</td><td>40</td></tr><tr style="text-align: center;"><td>Yi-1.5-34B+VL-34B</td><td>24</td><td>13</td><td>37</td></tr><tr style="text-align: center;"><td>Mixtral 8x22B</td><td>18</td><td>17</td><td>35</td></tr><tr style="text-align: center;"><td>GLM4-9B+4v-9B</td><td>12</td><td>15</td><td>27</td></tr><tr style="text-align: center;"><td>Qwen2-72B+VL-7B</td><td>6</td><td>0</td><td>6</td></tr><tr style="text-align: center;"><td>Qwen2-57B+VL-7B</td><td>6</td><td>0</td><td>6</td></tr></table>The scores for each question in the test paper are as follows: 
<table border="1">
    <tr style="text-align: center;">
        <th rowspan="2">Chemistry</th>
        <th rowspan="2">Question number</th>
<th colspan="7">Multiple Choice Questions</th><th colspan="3">Fill-in-the-Blank Questions</th><th rowspan="2">Total Score</th><th rowspan="2">Total Score for Questions with Diagrams</th><th rowspan="2">Total Score for Questions without Diagrams</th></tr>
<tr style="text-align: center;"><th>1</th><th>2*</th><th>3*</th><th>4</th><th>5</th><th>6*</th><th>7*</th><th>8*</th><th>9*</th><th>10*</th><th>11*</th></tr><tr style="text-align: center;"><td>Test Model</td><td>Score</td><td>6</td><td>6</td><td>6</td><td>6</td><td>6</td><td>6</td><td>6</td><td>14</td><td>15</td><td>14</td><td>15</td><td>100	(100%)</td><td>82	(82%)</td><td>18	(18%)</td></tr><tr style="text-align: center;"><td>WQX-20B+VL-20B</td><td></td><td>6</td><td>0</td><td>6</td><td>6</td><td>0</td><td>0</td><td>0</td><td>7</td><td>14</td><td>4</td><td>5</td><td>48	(48%)</td><td>36	(43.9%)</td><td>12	(66.7%)</td></tr><tr style="text-align: center;"><td>Qwen2-72B</td><td></td><td>6</td><td>0</td><td>6</td><td>0</td><td>0</td><td>6</td><td>0</td><td>5</td><td>12</td><td>7</td><td>2</td><td>44	(44%)</td><td>38	(46.3%)</td><td>6	(33.3%)</td></tr><tr style="text-align: center;"><td>GPT-4o</td><td></td><td>6</td><td>0</td><td>6</td><td>0</td><td>0</td><td>6</td><td>0</td><td>5</td><td>8</td><td>8</td><td>1</td><td>40	(40%)</td><td>34	(41.5%)</td><td>6	(33.3%)</td></tr><tr style="text-align: center;"><td>Yi-1.5-34B+VL-34B</td><td></td><td>6</td><td>0</td><td>0</td><td>6</td><td>6</td><td>0</td><td>6</td><td>5</td><td>7</td><td>0</td><td>1</td><td>37	(37%)</td><td>19	(23.2%)</td><td>18	(100%)</td></tr><tr style="text-align: center;"><td>Mixtral 8x22B</td><td></td><td>0</td><td>0</td><td>0</td><td>6</td><td>6</td><td>6</td><td>0</td><td>3</td><td>6</td><td>8</td><td>0</td><td>35	(35%)</td><td>23	(28%)</td><td>12	(66.7%)</td></tr><tr style="text-align: center;"><td>GLM4-9B+4v-9B</td><td></td><td>6</td><td>0</td><td>0</td><td>0</td><td>0</td><td>6</td><td>0</td><td>1</td><td>11</td><td>3</td><td>0</td><td>27	(27%)</td><td>21	(25.6%)</td><td>6	(33.3%)</td></tr><tr style="text-align: center;"><td>Qwen2-72B+VL-7B</td><td></td><td>6</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>6	(6%)</td><td>0	(0%)</td><td>6	(33.3%)</td></tr><tr style="text-align: center;"><td>Qwen2-57B+VL-7B</td><td></td><td>6</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>6	(6%)</td><td>0	(0%)</td><td>6	(33.3%)</td></tr></table>

Questions with asterisks (\*) indicate that the question contains images. If the model name includes "+VL," it indicates that questions involving images will be processed using the corresponding multimodal version of the model for inference. If "+VL" is not present, only text-based inference will be performed without considering images.

## Biology

The scores for each section of the test paper are as follows: 
<table border="1">
<tr style="text-align: center;">
    <th colspan="4" style="text-align: center;">Score Distribution for Each Question Type in Biology</th>
</tr>
<tr style="text-align: center;">
    <td>Model</td>
<td>Multiple Choice Questions (Full Score: 36)</td>
<td>Fill-in-the-Blank Questions (Full Score: 54)</td>
<td>Total Score (Full Score: 90)</td></tr><tr style="text-align: center;"><td>GLM4-9B+4v-9B</td><td>36</td><td>31</td><td>67</td></tr><tr style="text-align: center;"><td>GPT-4o</td><td>30</td><td>35</td><td>65</td></tr><tr style="text-align: center;"><td>WQX-20B+VL-20B</td><td>24</td><td>33</td><td>57</td></tr><tr style="text-align: center;"><td>Qwen2-57B+VL-7B</td><td>30</td><td>21</td><td>51</td></tr><tr style="text-align: center;"><td>Yi-1.5-34B+VL-34B</td><td>18</td><td>31</td><td>49</td></tr><tr style="text-align: center;"><td>Qwen2-72B+VL-7B</td><td>30</td><td>18</td><td>48</td></tr><tr style="text-align: center;"><td>Mixtral 8x22B</td><td>6</td><td>40</td><td>46</td></tr></table>The scores for each question in the test paper are as follows:
<table border="1">
    <tr style="text-align: center;">
        <th rowspan="2">Biology</th>
        <th rowspan="2">Question number</th>
<th colspan="6">Multiple Choice Questions</th><th colspan="5">Fill-in-the-Blank Questions</th><th rowspan="2">Total Score</th><th rowspan="2">Total Score for Questions with Diagrams</th><th rowspan="2">Total Score for Questions without Diagrams</th></tr>
<tr style="text-align: center;"><th>1</th><th>2</th><th>3</th><th>4</th><th>5*</th><th>6*</th><th>7*</th><th>8</th><th>9</th><th>10</th><th>11*</th></tr><tr style="text-align: center;"><td>Test Model</td><td>Score</td><td>6</td><td>6</td><td>6</td><td>6</td><td>6</td><td>6</td><td>12</td><td>10</td><td>10</td><td>10</td><td>12</td><td>90	(100%)</td><td>36	(40%)</td><td>54	(60%)</td></tr><tr style="text-align: center;"><td>GLM4-9B+4v-9B</td><td></td><td>6</td><td>6</td><td>6</td><td>6</td><td>6</td><td>6</td><td>7</td><td>10</td><td>8</td><td>0</td><td>6</td><td>67	(74.4%)</td><td>25	(69.4%)</td><td>42	(77.8%)</td></tr><tr style="text-align: center;"><td>GPT-4o</td><td></td><td>6</td><td>6</td><td>6</td><td>6</td><td>0</td><td>6</td><td>6</td><td>10</td><td>8</td><td>5</td><td>6</td><td>65	(72.2%)</td><td>18	(50%)</td><td>47	(87%)</td></tr><tr style="text-align: center;"><td>WQX-20B+VL-20B</td><td></td><td>0</td><td>6</td><td>6</td><td>6</td><td>0</td><td>6</td><td>7</td><td>8</td><td>8</td><td>4</td><td>6</td><td>57	(63.3%)</td><td>19	(52.8%)</td><td>38	(70.4%)</td></tr><tr style="text-align: center;"><td>Qwen2-57B+VL-7B</td><td></td><td>6</td><td>6</td><td>6</td><td>6</td><td>6</td><td>0</td><td>0</td><td>10</td><td>9</td><td>2</td><td>0</td><td>51	(56.7%)</td><td>6	(16.7%)</td><td>45	(83.3%)</td></tr><tr style="text-align: center;"><td>Yi-1.5-34B+VL-34B</td><td></td><td>0</td><td>6</td><td>6</td><td>6</td><td>0</td><td>0</td><td>2</td><td>9</td><td>10</td><td>4</td><td>6</td><td>49	(54.4%)</td><td>8	(22.2%)</td><td>41	(75.9%)</td></tr><tr style="text-align: center;"><td>Qwen2-72B+VL-7B</td><td></td><td>6</td><td>6</td><td>6</td><td>6</td><td>6</td><td>0</td><td>0</td><td>8</td><td>8</td><td>2</td><td>0</td><td>48	(53.3%)</td><td>6	(16.7%)</td><td>42	(77.8%)</td></tr><tr style="text-align: center;"><td>Mixtral 8x22B</td><td></td><td>0</td><td>0</td><td>0</td><td>6</td><td>0</td><td>0</td><td>8</td><td>8</td><td>8</td><td>8</td><td>8</td><td>46	(51.1%)</td><td>16	(44.4%)</td><td>30	(55.6%)</td></tr></table>

Questions with asterisks (\*) indicate that the question contains images. If the model name includes "+VL," it indicates that questions involving images will be processed using the corresponding multimodal version of the model for inference. If "+VL" is not present, only text-based inference will be performed without considering images.

## History

The scores for each section of the test paper are as follows: 

<table border="1">
<tr style="text-align: center;">
    <th colspan="4" style="text-align: center;">Score Distribution for Each Question Type in History</th>
</tr>
<tr style="text-align: center;">
    <td>Model</td>  
<td>Multiple Choice Questions (Full Score: 48)</td><td>Short Answer Questions (Full Score: 52)</td><td>Total Score (Full Score: 100)</td></tr><tr style="text-align: center;"><td>GPT-4o</td><td>44</td><td>44</td><td>88</td></tr><tr style="text-align: center;"><td>Qwen2-72B+VL-7B</td><td>40</td><td>45</td><td>85</td></tr><tr style="text-align: center;"><td>WQX-20B+VL-20B</td><td>44</td><td>38</td><td>82</td></tr><tr style="text-align: center;"><td>GLM4-9B+4v-9B</td><td>48</td><td>32</td><td>80</td></tr><tr style="text-align: center;"><td>Qwen2-57B+VL-7B</td><td>44</td><td>29</td><td>73</td></tr><tr style="text-align: center;"><td>Mixtral 8x22B</td><td>12</td><td>42</td><td>54</td></tr><tr style="text-align: center;"><td>Yi-1.5-34B+VL-34B</td><td>48</td><td>0</td><td>48</td></tr></table>The scores for each question in the test paper are as follows:
<table border="1">
    <tr style="text-align: center;">
        <th rowspan="2">History</th>
        <th rowspan="2">Question number</th>
<th colspan="12">Multiple Choice Questions</th><th colspan="3">Short Answer Questions</th><th rowspan="2">Total Score</th><th rowspan="2">Total Score for Questions with Diagrams</th><th rowspan="2">Total Score for Questions without Diagrams</th></tr>
<tr style="text-align: center;"><th>1</th><th>2</th><th>3</th><th>4*</th><th>5</th><th>6*</th><th>7</th><th>8</th><th>9</th><th>10</th><th>11</th><th>12</th><th>13.1</th><th>13.2</th><th>14</th><th>15.1</th><th>15.2</th></tr><tr style="text-align: center;"><td>Test Model</td><td>Score</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>13</td><td>12</td><td>12</td><td>9</td><td>6</td><td>100	(100%)</td><td>8	(8%)</td><td>92	(92%)</td></tr><tr style="text-align: center;"><td>GPT-4o</td><td></td><td>4</td><td>4</td><td>4</td><td>0</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>8</td><td>12</td><td>9</td><td>9</td><td>6</td><td>88	(88%)</td><td>4	(50%)</td><td>84	(91.3%)</td></tr><tr style="text-align: center;"><td>Qwen2-72B+VL-7B</td><td></td><td>4</td><td>4</td><td>4</td><td>4</td><td>0</td><td>0</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>6</td><td>12</td><td>12</td><td>9</td><td>6</td><td>85	(85%)</td><td>4	(50%)</td><td>81	(88%)</td></tr><tr style="text-align: center;"><td>WQX-20B+VL-20B</td><td></td><td>4</td><td>4</td><td>4</td><td>0</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>8</td><td>8</td><td>7</td><td>9</td><td>6</td><td>82	(82%)</td><td>4	(50%)</td><td>78	(84.8%)</td></tr><tr style="text-align: center;"><td>GLM4-9B+4v-9B</td><td></td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>11</td><td>7</td><td>6</td><td>80	(80%)</td><td>8	(100%)</td><td>72	(78.3%)</td></tr><tr style="text-align: center;"><td>Qwen2-57B+VL-7B</td><td></td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>0</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>6</td><td>12</td><td>8</td><td>3</td><td>0</td><td>73	(73%)</td><td>4	(50%)</td><td>69	(75%)</td></tr><tr style="text-align: center;"><td>Mixtral 8x22B</td><td></td><td>4</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>4</td><td>4</td><td>7</td><td>12</td><td>12</td><td>7</td><td>4</td><td>54	(54%)</td><td>0	(0%)</td><td>54	(58.7%)</td></tr><tr style="text-align: center;"><td>Yi-1.5-34B+VL-34B</td><td></td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>48	(48%)</td><td>8	(100%)</td><td>40	(43.5%)</td></tr></table>

Questions with asterisks (\*) indicate that the question contains images. If the model name includes "+VL," it indicates that questions involving images will be processed using the corresponding multimodal version of the model for inference. If "+VL" is not present, only text-based inference will be performed without considering images.

## Geography

The scores for each section of the test paper are as follows: 

<table border="1">
<tr style="text-align: center;">
    <th colspan="4" style="text-align: center;">Score Distribution for Each Question Type in Geography</th>
</tr>
<tr style="text-align: center;">
    <td>Model</td>
    <td>Multiple Choice Questions (Full Score: 44)</td>
    <td>Short Answer Questions (Full Score: 56)</td>
    <td>Total Score (Full Score: 100)</td></tr><tr style="text-align: center;"><td>Qwen2-72B</td><td>32</td><td>38</td><td>70</td></tr><tr style="text-align: center;"><td>GLM4-9B+4v-9B</td><td>36</td><td>26</td><td>62</td></tr><tr style="text-align: center;"><td>GPT-4o</td><td>32</td><td>27</td><td>59</td></tr><tr style="text-align: center;"><td>WQX-20B+VL-20B</td><td>32</td><td>26</td><td>58</td></tr><tr style="text-align: center;"><td>Mixtral 8x22B</td><td>24</td><td>32</td><td>56</td></tr><tr style="text-align: center;"><td>Yi-1.5-34B+VL-34B</td><td>20</td><td>21</td><td>41</td></tr><tr style="text-align: center;"><td>Qwen2-72B+VL-7B</td><td>4</td><td>0</td><td>4</td></tr><tr style="text-align: center;"><td>Qwen2-57B+VL-7B</td><td>4</td><td>0</td><td>4</td></tr></table>The scores for each question in the test paper are as follows: 
<table border="1">
    <tr style="text-align: center;">
        <th rowspan="2">Geography</th>
        <th rowspan="2">Question number</th>
		<th colspan="11">Multiple Choice Questions</th>
        <th colspan="8">Short Answer Questions</th>
        <th rowspan="2">Total Score</th>
        <th rowspan="2">Total Score for Questions with Diagrams</th>
        <th rowspan="2">Total Score for Questions without Diagrams</th></tr>
<tr style="text-align: center;"><th>1.1*</th><th>1.2*</th><th>1.3*</th><th>2.1*</th><th>2.2*</th><th>3.1*</th><th>3.2*</th><th>3.3*</th><th>4.1*</th><th>4.2*</th><th>4.3*</th><th>5.1*</th><th>5.2*</th><th>5.3*</th><th>5.4*</th><th>6.1*</th><th>6.2*</th><th>6.3*</th><th>6.4*</th><th>6.5*</th></tr><tr style="text-align: center;"><td>Test Model</td><td>Score</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>6</td><td>6</td><td>8</td><td>8</td><td>6</td><td>6</td><td>4</td><td>6</td><td>6</td><td>100	(100%)</td><td>100	(100%)</td><td>0	(0%)</td></tr><tr style="text-align: center;"><td>Qwen2-72B</td><td></td><td>4</td><td>4</td><td>4</td><td>0</td><td>4</td><td>0</td><td>4</td><td>0</td><td>4</td><td>4</td><td>4</td><td>2</td><td>4</td><td>8</td><td>8</td><td>4</td><td>6</td><td>0</td><td>2</td><td>4</td><td>70	(70%)</td><td>70	(70%)</td><td>0	(0%)</td></tr><tr style="text-align: center;"><td>GLM4-9B+4v-9B</td><td></td><td>4</td><td>4</td><td>4</td><td>0</td><td>4</td><td>0</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>3</td><td>2</td><td>8</td><td>6</td><td>2</td><td>1</td><td>2</td><td>2</td><td>0</td><td>62	(62%)</td><td>62	(62%)</td><td>0	(0%)</td></tr><tr style="text-align: center;"><td>GPT-4o</td><td></td><td>4</td><td>4</td><td>4</td><td>0</td><td>4</td><td>0</td><td>4</td><td>0</td><td>4</td><td>4</td><td>4</td><td>0</td><td>4</td><td>6</td><td>4</td><td>4</td><td>3</td><td>2</td><td>0</td><td>4</td><td>59	(59%)</td><td>59	(59%)</td><td>0	(0%)</td></tr><tr style="text-align: center;"><td>WQX-20B+VL-20B</td><td></td><td>4</td><td>4</td><td>4</td><td>0</td><td>4</td><td>0</td><td>4</td><td>4</td><td>4</td><td>0</td><td>4</td><td>2</td><td>4</td><td>0</td><td>4</td><td>6</td><td>6</td><td>2</td><td>0</td><td>2</td><td>58	(58%)</td><td>58	(58%)</td><td>0	(0%)</td></tr><tr style="text-align: center;"><td>Mixtral 8x22B</td><td></td><td>4</td><td>4</td><td>4</td><td>0</td><td>4</td><td>0</td><td>0</td><td>4</td><td>4</td><td>0</td><td>0</td><td>2</td><td>4</td><td>6</td><td>2</td><td>6</td><td>6</td><td>2</td><td>4</td><td>0</td><td>56	(56%)</td><td>56	(56%)</td><td>0	(0%)</td></tr><tr style="text-align: center;"><td>Yi-1.5-34B+VL-34B</td><td></td><td>4</td><td>0</td><td>0</td><td>0</td><td>4</td><td>0</td><td>0</td><td>4</td><td>0</td><td>4</td><td>4</td><td>2</td><td>6</td><td>2</td><td>2</td><td>4</td><td>3</td><td>0</td><td>0</td><td>2</td><td>41	(41%)</td><td>41	(41%)</td><td>0	(0%)</td></tr><tr style="text-align: center;"><td>Qwen2-72B+VL-7B</td><td></td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>4</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>4	(4%)</td><td>4	(4%)</td><td>0	(0%)</td></tr><tr style="text-align: center;"><td>Qwen2-57B+VL-7B</td><td></td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>4</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>4	(4%)</td><td>4	(4%)</td><td>0	(0%)</td></tr></table>

Questions with asterisks (\*) indicate that the question contains images. If the model name includes "+VL," it indicates that questions involving images will be processed using the corresponding multimodal version of the model for inference. If "+VL" is not present, only text-based inference will be performed without considering images.

## Politics

The scores for each section of the test paper are as follows: 
<table border="1">
<tr style="text-align: center;">
    <th colspan="4" style="text-align: center;">Score Distribution for Each Question Type in  Politics</th>
</tr>
<tr style="text-align: center;">
    <td>Model</td>
<td>Multiple Choice Questions (Full Score: 48)</td><td>Short Answer Questions (Full Score: 52)</td><td>Total Score (Full Score: 100)</td></tr><tr style="text-align: center;"><td>WQX-20B+VL-20B</td><td>40</td><td>27</td><td>67</td></tr><tr style="text-align: center;"><td>Qwen2-57B+VL-7B</td><td>36</td><td>26</td><td>62</td></tr><tr style="text-align: center;"><td>Qwen2-72B+VL-7B</td><td>40</td><td>20</td><td>60</td></tr><tr style="text-align: center;"><td>GPT-4o</td><td>44</td><td>14</td><td>58</td></tr><tr style="text-align: center;"><td>Yi-1.5-34B+VL-34B</td><td>36</td><td>15</td><td>51</td></tr><tr style="text-align: center;"><td>GLM4-9B+4v-9B</td><td>36</td><td>12</td><td>48</td></tr><tr style="text-align: center;"><td>Mixtral 8x22B</td><td>28</td><td>10</td><td>38</td></tr></table>The scores for each question in the test paper are as follows: 
<table border="1">
    <tr style="text-align: center;">
        <th rowspan="2">Politics</th>
        <th rowspan="2">Question number</th>
<th colspan="12">Multiple Choice Questions</th><th colspan="5">Short Answer Questions</th><th rowspan="2">Total Score</th>
        <th rowspan="2">Total Score for Questions with Diagrams</th>
        <th rowspan="2">Total Score for Questions without Diagrams</th></tr>
<tr style="text-align: center;"><th>1</th><th>2</th><th>3</th><th>4</th><th>5</th><th>6</th><th>7</th><th>8</th><th>9</th><th>10</th><th>11*</th><th>12</th><th>13</th><th>14.1</th><th>14.2</th><th>15</th><th>16</th></tr><tr style="text-align: center;"><td>Test Model</td><td>Score</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>12</td><td>10</td><td>10</td><td>10</td><td>10</td><td>100	(100%)</td><td>4	(4%)</td><td>96	(96%)</td></tr><tr style="text-align: center;"><td>WQX-20B+VL-20B</td><td></td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>0</td><td>0</td><td>4</td><td>4</td><td>6</td><td>6</td><td>6</td><td>5</td><td>67	(67%)</td><td>0	(0%)</td><td>67	(69.8%)</td></tr><tr style="text-align: center;"><td>Qwen2-57B+VL-7B</td><td></td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>0</td><td>0</td><td>0</td><td>2</td><td>6</td><td>9</td><td>5</td><td>4</td><td>62	(62%)</td><td>0	(0%)</td><td>62	(64.6%)</td></tr><tr style="text-align: center;"><td>Qwen2-72B+VL-7B</td><td></td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>0</td><td>0</td><td>4</td><td>1</td><td>3</td><td>6</td><td>5</td><td>5</td><td>60	(60%)</td><td>0	(0%)</td><td>60	(62.5%)</td></tr><tr style="text-align: center;"><td>GPT-4o</td><td></td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>0</td><td>4</td><td>4</td><td>0</td><td>2</td><td>1</td><td>5</td><td>6</td><td>58	(58%)</td><td>4	(100%)</td><td>54	(56.2%)</td></tr><tr style="text-align: center;"><td>Yi-1.5-34B+VL-34B</td><td></td><td>4</td><td>0</td><td>4</td><td>4</td><td>4</td><td>4</td><td>0</td><td>4</td><td>4</td><td>0</td><td>4</td><td>4</td><td>3</td><td>4</td><td>6</td><td>0</td><td>2</td><td>51	(51%)</td><td>4	(100%)</td><td>47	(49%)</td></tr><tr style="text-align: center;"><td>GLM4-9B+4v-9B</td><td></td><td>4</td><td>4</td><td>4</td><td>0</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>0</td><td>0</td><td>4</td><td>2</td><td>3</td><td>0</td><td>5</td><td>2</td><td>48	(48%)</td><td>0	(0%)</td><td>48	(50%)</td></tr><tr style="text-align: center;"><td>Mixtral 8x22B</td><td></td><td>4</td><td>4</td><td>4</td><td>0</td><td>4</td><td>4</td><td>0</td><td>4</td><td>4</td><td>0</td><td>0</td><td>0</td><td>2</td><td>3</td><td>0</td><td>3</td><td>2</td><td>38	(38%)</td><td>0	(0%)</td><td>38	(39.6%)</td></tr></table>

Questions with asterisks (\*) indicate that the question contains images. If the model name includes "+VL," it indicates that questions involving images will be processed using the corresponding multimodal version of the model for inference. If "+VL" is not present, only text-based inference will be performed without considering images.

## Overall Feedback from Teachers

After grading all subjects, we informed the teachers that the answers to the above examination papers were generated by large models. We then invited the grading teachers to provide feedback on the overall performance of the seven large models.

**Comments from the Chinese Teacher**:  
The large models perform adequately in translating classical Chinese texts; however, they largely fail in subjective questions, struggling to understand the prompts and often misinterpreting the referents of certain pronouns, leading to irrelevant answers. The essays generated by the models do not resemble typical Gaokao essays and are more akin to Q&A responses. Although they are targeted, they lack embellishment. Human candidates typically use examples and citations, often quoting famous individuals and utilizing character materials, but the models rarely do so. When tasked with writing a metaphorical sentence, the models almost universally failed, confusing the tenor and vehicle, indicating a lack of understanding of the "metaphor" technique and what the "tenor" is. Additionally, they struggled with sentence completion tasks, showing difficulty in maintaining coherence with the context and adhering to certain linguistic conventions in Chinese. For instance, if a new concept like "sleep quality" appears later in the text, it should also be included in the sentence completion; otherwise, its sudden introduction feels abrupt and disconnected. The models also struggle to grasp subtleties in language.

**Comments from the Mathematics Teacher**:  
The models generally exhibit a mechanical approach to problem-solving, with most completing tasks through enumeration rather than reasoning. For example, in the first question, the models were unable to address the second question and could only rigidly select a formula without analyzing how to derive the conclusion based on the given information, as a human candidate would. Additionally, the models are better at analytic geometry than at solid geometry, where they exhibit absurd reasoning processes. The models' ability to memorize formulas is decent, but they struggle to apply them flexibly during problem-solving. Some problems appear correct at first glance, but the logical flow in their solutions is often lacking, making grading challenging.

**Comments from the English Teacher**:  
The language is impeccable; however, some models fail to recognize the requirements of questions, such as misidentifying a "choose five" question as a single-choice question and interpreting four questions as five. The capabilities of the large models are most evident in the continuation writing task. Firstly, the prompt requires a total word count of 150, but several models interpreted this as 150 words per paragraph, resulting in a total exceeding 300. Secondly, regarding the understanding of essay prompts, some models produced plots that completely diverged from the examiner's intent, with some even writing "magical realism" scenarios, such as lending money without repayment and taxi drivers having insider information from German banks, which are highly unrealistic movie plots. This highlights the difference between models and real examinees; candidates generally have weaker language skills but stronger comprehension abilities, leading to reasonably constructed plots, while models exhibit perfect language but express ideas that do not align with common sense and human emotions.

**Comments from the Politics Teacher**:  
Overall, the models had a low accuracy rate for multiple-choice questions, and their responses to short answer questions were overly mechanical. Particularly concerning was the first short answer question about the principal body of the National People's Congress, which none of the models answered correctly, failing to reference textbook knowledge. They did not integrate textbook concepts, mechanically repeating material without connecting to theoretical knowledge. Additionally, the models struggled to accurately interpret questions, which is a common issue across all models. They often failed to analyze the angle of the questions, such as identifying whether they were asking about significance, reasons, or measures, resulting in insufficiently standardized responses. The only questions that scored relatively well were those requiring mechanical knowledge, such as the question on dialectical thinking, which scored well because it fell within a narrow knowledge range and was considered an easy question in the exam.

**Comments from the History Teacher**:  
For multiple-choice questions, there are instances where two answers are given.  
For subjective questions: 1. Key points are incomplete, failing to analyze the prompts and materials from political, economic, and cultural perspectives; 2. There is a tendency to copy materials verbatim; 3. The language is verbose, with excessive word counts lacking conciseness, failing to provide a high-level summary of the materials, especially in the short essay section, which often consists of mere listings of historical facts without analysis; 4. There is a lack of understanding of the prompts, resulting in answers that do not address the questions, such as responding with effects when asked for causes. Thus, compared to humans, the main issue with the models is a lack of understanding of answering techniques and standards.

**Comments from the Geography Teacher**:  
Among the seven examination papers, two contained garbled text, specifically Geography C and Geography E, while the other five papers were assessable.  
Regarding multiple-choice questions: In five papers, A and D scored low on multiple-choice questions, while the others performed well.  
Regarding subjective questions: The overall score rate for subjective questions was low, with instances of irrelevant answers. There is a significant difference between the subjective questions completed by the models and those completed by humans, primarily reflected in three aspects: first, the bullet-point format is ineffective, with paragraph-style answers obscuring key scoring points; second, there is redundancy in describing a scoring point, often resulting in lengthy narratives that yield low scoring efficiency; third, the answers lack geographical logic, failing to analyze cause and effect, merely describing answers.

**Comments from the Physics Teacher**:  
In the physics section of the new curriculum standard exam, the models had a decent accuracy rate for multiple-choice questions, but there is room for improvement in experimental sections and larger questions. Their derivation steps are more comprehensive and complete than those of students, but they can only engage in surface-level thinking, failing to correctly address extension or transformation questions related to knowledge points.  
In the multiple-choice section, the models sometimes select all four options or provide unreasonable answers for all four options, even when recognizing a single-choice question, they still adhere to their own derivation. In multiple-choice questions, they emphasize which of the two selected options is more reasonable if it were a single-choice question, resulting in unnecessary work. There are also instances where they derive a lot without arriving at a definitive answer. Overall, the models are less reliable than students when answering multiple-choice questions, although their derivation ability is commendable.  
The models' responses to experimental questions were entirely incorrect, whether it involved ordering, shifting gears, selecting instruments, or reading data, as well as slightly more discerning questions regarding the connection of red and black test leads or the expression of the internal resistance formula of a voltmeter, all of which demonstrate that the models can only identify the most basic knowledge points and cannot assess expanded or transformed situations.  
In Question 11, the models would perform two orthogonal decompositions for the tension in two ropes, with no numerical relationship between them. In contrast, students would identify a research object and perform a single orthogonal decomposition, using mathematical equivalences to solve the problem. The models clearly cannot achieve this and can only make judgments based on literal meanings.  
In Question 12, the models state that they can calculate the accurate answer but do not perform the calculation themselves, even in the answer section, they only provide the thought process.  
In Question 13, the calculations for the first two physical quantities are basically correct, but the third physical quantity presents difficulties, with deviations in the direction of the derivation thought process.

**Comments from the Chemistry Teacher**:  
Overall, the models exhibit very rigid language when answering questions, lacking a sense of logic. Sometimes a particular sentence feels relevant, but the next sentence fails to maintain logical coherence, which makes it difficult to assess even though the initial thought process appears correct. Of course, apart from the third and seventh sets of time-limited papers (which are essentially the same and lack logical coherence, with all subjective questions being garbled), the other five sets have commendable aspects, especially the multiple-choice questions. For instance, in the second, fourth, and sixth sets, although the accuracy rate is not high, the advantage lies in the analysis provided for each option, and the language closely resembles human analytical language. This may be due to issues with data integration leading to incorrect answers. Additionally, regarding subjective questions, the fifth set significantly surpasses other models, particularly in Questions 8 and 9, which, although lacking explanation and analysis, provide very concise answers that correspond directly to the questions. However, since these are generated by large models, the overall accuracy rate is still around 30%, but there are early signs of human-like responses. The difference from real examinees is that candidates generally have weaker language skills but stronger comprehension abilities; as long as the response is on point, the language of the models is often verbose but lacks logical coherence and fails to identify key points in the answers. Therefore, improving the language logic related to big data would be beneficial.

**Comments from the Biology Teacher**:  
The models generally have a decent accuracy rate, but for some questions, such as multiple-choice questions, the answers may be correct while the explanations for certain options are inaccurate or even erroneous.  
The large models sometimes produce garbled text or no answers at all.  
For biology experimental design questions, the models appear somewhat rigid in their thinking. However, their responses to basic knowledge questions are relatively good.

## Answer Examples

- Chinese

Question:
```
宿千岁庵听泉  
刘克庄  
因爱庵前一脉泉,襆衾来此借房眠。  
骤闻将谓溪当户,久听翻疑屋是船。  
变作怒声犹壮伟,滴成细点更清圆。  
君看昔日兰亭帖,亦把湍流替管弦。  
 
诗中以对比的方式,对泉声进行了生动的描写,请结合相关内容简要赏析。(6分)
```
Output（Model：Qwen2-70B）：
```
【小题1】选项D的解读有误。诗人提到“君看昔日兰亭帖,亦把湍流替管弦”，并非意味着他和兰亭诸贤一样摒弃了乐器，而是用典故表达了对自然之音的欣赏，认为自然界的流水声可以替代乐器发出的音乐，达到同样的审美效果。因此，选项D的表述“都把对音乐之美的追求寄托于山水而摒弃了乐器”并不准确。
【小题2】诗中通过对比的方式，对泉声进行了生动的描绘，具体体现在以下几个方面：
 1. 骤闻与久听的对比：“骤闻将谓溪当户,久听翻疑屋是船。”这句诗通过“骤闻”与“久听”的对比，展现了泉声给人带来的不同感受。初听时，诗人以为是溪水就在门口流淌，声音宏大；而长时间倾听后，却产生了错觉，仿佛自己所处的屋子就像是一艘船，在水上漂浮。这种对比不仅表现了泉声的多变性，也体现了诗人对泉声由外在感知转向内在体验的过程。
 2. 怒声与细点的对比：“变作怒声犹壮伟,滴成细点更清圆。”这里通过“怒声”与“细点”的对比，展现了泉声从强烈到细腻的变化。起初，泉声如怒涛般壮阔，给人以震撼；随后，细小的水滴落下，声音变得清脆圆润，如同珍珠落盘，给人以宁静之美。这种对比手法，使读者能够更加直观地感受到泉声的丰富层次和变化。
 3. 自然与艺术的对比：“君看昔日兰亭帖,亦把湍流替管弦。”诗人引用兰亭雅集的典故，将自然界的湍流声与古代文人雅士聚会时的管弦乐声进行对比，暗示自然之声同样具有艺术美感，甚至可以超越人为的音乐。这种对比不仅提升了泉声的艺术价值，也反映了诗人崇尚自然、追求心灵与自然和谐共融的生活态度。
 综上所述，诗中通过对比的手法，不仅生动地描绘了泉声的多样性和美感，还表达了诗人对自然之美的深刻感悟和对艺术境界的追求。
```
Teacher's comments： 
> This answer provides a clear and organized analysis of the use of contrasting techniques in poetry from different perspectives, including the comparison between "sudden hearing" and "prolonged listening", "angry voice" and "subtle dots", etc. It has a good understanding and analytical ability of poetry.

_____
- Mathematics

Question：
```latex
15. 记  $\triangle A B C$ 的内角 $A, B, C$ 的对边分别为 $a, b, c$, 已知 $\sin C=\sqrt{2} \cos B, a^2+b^2-c^2=\sqrt{2} a b$
(1) 求 $B$;
(2) 若  $\triangle A B C$  的面积 $3+\sqrt{3}
$, 求 $c$。
```
Output（Model：WQX-20B）：
```latex
首先，我们有两个关于三角形$ABC$的条件：

1. $\sin C = \sqrt{2} \cos B$
2. $a^2 + b^2 - c^2 = \sqrt{2} ab$

从第二个条件出发，我们可以使用余弦定理，它表述为 $c^2 = a^2 + b^2 - 2ab\cos C$。将给定的等式与余弦定理比较，我们得到：

$\cos C = \frac{\sqrt{2} ab}{2ab} = \frac{\sqrt{2}}{2}$

由于$C$是三角形的内角，其范围是$(0, \pi)$，所以$C$的值是：

$C = \frac{\pi}{4}$

接下来，我们使用第一个条件和三角恒等式来找到$B$的值。由于$\sin C = \sin(\frac{\pi}{4}) = \frac{\sqrt{2}}{2}$，我们有：

$\sin(\frac{\pi}{4}) = \sqrt{2} \cos B$

这可以简化为：

$\cos B = \frac{\sin(\frac{\pi}{4})}{\sqrt{2}} = \frac{1}{2}$

由于$B$也是三角形的内角，其范围是$(0, \pi)$，所以$B$的值是：

$B = \frac{\pi}{3}$

现在，我们已知$B = \frac{\pi}{3}$，$C = \frac{\pi}{4}$，我们可以找到$A$的值，因为三角形内角和为$\pi$：

$A = \pi - B - C = \pi - \frac{\pi}{3} - \frac{\pi}{4} = \frac{5\pi}{12}$

接下来，我们使用正弦定理来找到边$c$的长度。正弦定理表述为：

$\frac{a}{\sin A} = \frac{b}{\sin B} = \frac{c}{\sin C}$

我们已知三角形的面积$S$为$3 + \sqrt{3}$，面积公式为：

$S = \frac{1}{2} ab\sin C$

代入已知的$\sin C$和面积，我们得到：

$3 + \sqrt{3} = \frac{1}{2} ab \cdot \frac{\sqrt{2}}{2}$

解这个方程得到$ab$的值：

$ab = 2\sqrt{2}(3 + \sqrt{3})$

现在，我们使用正弦定理来找到$c$：

$c = \frac{a\sin C}{\sin A}$

由于$\sin A = \sin(\frac{5\pi}{12})$，我们可以使用和角公式来找到$\sin A$：

$\sin A = \sin(\frac{\pi}{4} + \frac{\pi}{6}) = \sin\frac{\pi}{4}\cos\frac{\pi}{6} + \cos\frac{\pi}{4}\sin\frac{\pi}{6} = \frac{\sqrt{2}}{2} \cdot \frac{\sqrt{3}}{2} + \frac{\sqrt{2}}{2} \cdot \frac{1}{2} = \frac{\sqrt{6} + \sqrt{2}}{4}$

现在我们可以计算$c$：

$c = \frac{a\sin C}{\sin A} = \frac{a \cdot \frac{\sqrt{2}}{2}}{\frac{\sqrt{6} + \sqrt{2}}{4}} = \frac{2a}{\sqrt{6} + \sqrt{2}}$

由于$ab = 2\sqrt{2}(3 + \sqrt{3})$，我们可以表示$a$为$b$的函数：

$a = \frac{2\sqrt{2}(3 + \sqrt{3})}{b}$

将$a$代入$c$的表达式中，我们得到：

$c = \frac{2 \cdot \frac{2\sqrt{2}(3 + \sqrt{3})}{b} \cdot \frac{\sqrt{2}}{2}}{\sqrt{6} + \sqrt{2}} = \frac{2(3 + \sqrt{3})}{\sqrt{6} + \sqrt{2}}$

为了简化$c$，我们可以乘以共轭分母：

$c = \frac{2(3 + \sqrt{3})(\sqrt{6} - \sqrt{2})}{(\sqrt{6} + \sqrt{2})(\sqrt{6} - \sqrt{2})} = \frac{2(3\sqrt{6} - 3\sqrt{2} + 3\sqrt{2} - 2\sqrt{3})}{6 - 2}$

$c = \frac{2(3\sqrt{6} - 2\sqrt{3})}{4} = \frac{3\sqrt{6} - 2\sqrt{3}}{2}$

所以，$B = \frac{\pi}{3}$，$c = \frac{3\sqrt{6} - 2\sqrt{3}}{2}$。
```
Teacher's comments： 
> This question tests the cosine theorem and sine theorem. The model correctly uses the cosine theorem to solve the first question, and cleverly uses the sine formula of the sum of two angles to simplify the second question. It is the only model among all models that correctly calculates $\ sin A $

_____

- English

Question
```
五、书信写作
66．假定你是李华，上周五你们班在公园上了一堂美术课。请你给英国朋友Chris写一封邮件分享这次经历，内容包括：
（1）你完成的作品；
（2）你的感想。
注意：
（1）写作词数应为80个左右；
（2）请按如下格式在答题纸的相应位置作答。
Dear Chris,
I’m writing to share with you an art class I had in a park last Friday. 
_____________________________________                          _____________________________________  

Yours,
Li Hua
````
Output（Model: GPT-4o）
```
I'm writing to share with you an art class I had in a park last Friday. We spent the day painting landscapes, and I completed a piece featuring a serene lake surrounded by trees and flowers. The experience was incredibly refreshing and inspiring. It felt wonderful to be outdoors, capturing the beauty of nature on canvas. I realized how much I enjoy painting and how it helps me relax and express my creativity. I hope we can have a similar experience together someday.
```
Teacher's comments： 
> Rich sentence patterns, flawless language, completed the task of expressing content, slightly fewer words, 1 point will be deducted at discretion.

